Optimal Stopping Times
Introduction
Let's say we're trying to detect positive cases in a series of samples. My use case for developing this method was in trying to detect whether chunks of text contain some trait (e.g., talk of self-harm). Whether they do or don't can be described as a collection of coin flips \(y_i \sim \text{Bernoulli}(p_i),\ i=1,...,n\), where \(i\) indexes a document, \(y_i=1\) if and only if the trait is present in the chunk, and \(p_i\) is the probability of \(y_i=1\). The task is for a human checker to find all the \(\{i: y_i=1\}\), but alas \(n\) is so large we can't actually check them all.

The most recent strategy we've discussed is to:
rank \(n\) chunks from most likely to contain the trait to the least likely,
verify the most likely ones first, checking in decreasing order of likelihood,
stop at some point.
If we stop before checking them all, of course we might miss some \(y_i=1\) cases. The method proposed here will give an estimate of how many such cases we are likely to have missed for any stopping point.
Assume the ranking above is already done: the ranker tried to re-order the \(y_i\)s to put \(p_i\)s in monotonically decreasing order, \(p_1 \geq p_2 \geq ... \geq p_n\).
The method we propose makes a single key assumption12:
\[\begin{aligned} p_i &= \boxed{\text{expit}\left(a + b\cdot \frac{i}{n}\right)}\ ,\text{where expit}(x) := \frac{1}{1+e^{-x}}.\label{eq} \end{aligned}\]
The goal is to check enough chunks such that the expected number of cases among the unchecked chunks, or count of false negatives (), is sufficiently low. Formally, the goal is to select \(i^*\) satisfying
\[\begin{aligned} \text{fn}(i^*) := \sum_{i=i^*+1}^n \mathbb{E}(y_i) \leq t \end{aligned}\] for a desired threshold \(t\).
Methods
Assume the human checker has already discovered the values \(\{y_1, ..., y_{i'}\}\). Is it okay to stop? That is, have we reached \(i'=i^*\)? If we estimate \(a\) and \(b\), from Equation/Assumption ([eq]) we immediately have the point estimate:
\[\begin{aligned} \widehat{\text{fn}}(i'; \widehat{a}, \widehat{b}) &= \sum_{i=i'+1}^n \widehat{\mathbb{E}}(y_i) \\ &= \sum_{i=i'+1}^n \widehat{p_i} \\ &= \sum_{i=i'+1}^n \text{expit}\left(\widehat{a}+\widehat{b}\cdot \frac{i}{n}\right) \end{aligned}\]
Very simply, we can use logistic regression to estimate \(a\) and \(b\) to accomplish this. Obtain maximum likelihood point estimates (and covariance)3 by modeling:
\[\begin{aligned} \text{logit}(y_i) \sim a + b\cdot \frac{i}{n},\quad i=1, ..., i' \end{aligned}\]
We can even make a predictive distribution for the false negative count, by drawing from:
\[\begin{aligned} \widehat{\text{fn}}(i'; \widehat{a}, \widehat{b}) &\sim \text{fn}(i';\ a', b'), \\ a', b' &\sim \text{Normal}\left((\widehat{a}, \widehat{b}),\ \widehat{\Sigma}(\widehat{a}, \widehat{b})\right). \end{aligned}\]
We're done checking chunks when this distribution shows an acceptable range.4
Simulation
Notice that for the \(p_i\) to be monotone decreasing, \(b\) must be negative. The crisper the ranking, the larger \(-b\) will be. Rather than think hard about \(a\) directly, I've found it more convenient to only specify a total number of cases \(m\in [1, ..., n]\) and use a root solver to find
\[\begin{aligned} a: \frac{1}{n}\sum_{i=1}^n\text{expit}\left(a+b\cdot \frac{i}{n}\right) - \frac{m}{n} = 0. \end{aligned}\]
I've developed simulation code that's easy to modify and play with; I'll introduce the figures they make, and demonstrate a few cases here. Beyond that, playing with the parameters yourself should give you a stronger sense of what's going on.
First, let's plot the ranked \(p_i\) values for \(n=1000\). In this case, the ranker is only modestly successful: \(b:=-10\). The number of cases is \(m=100\). Who cares what \(a\) turns out to be. The red line shows how many \(y_i\) have been checked.5
What follows next is a histogram of draws from the predictive distribution of false negatives, \(\widehat{\text{fn}}(i'; \widehat{a}, \widehat{b})\). We'd like these numbers to be small enough to stomach. If not low enough, keep checking \(y_i\)s. I've also put a blue bar showing the true expected number of remaining cases.
Please note that the \(y\)-axis is log-scaled, and the \(x\)-axis is not constrained.
Are these numbers too high? This is for someone in charge to decide.
The next page shows a progression of three checkpoints for the checker. In this simulation, the ranker is assumed to have had an easier time (\(b=-100\)).
The first row shows a clearly too-early stopping point (\(i'=m/2=50\)); the resulting false negatives distribution shows abysmally high mean and variance.
The middle row shows the situation when the checker notices that the cases currently being checked are 50/50 hit or miss (in stark contrast to earlier checks)6. This actually looks like an acceptable place to stop!
The final row is overkill: they haven't seen a case in a while, they just wanted to be sure. The resulting false negative distribution agrees: the expected number of remaining cases is a strong zero. Not one, zero.






R
Code
library(mvtnorm)
# ### set important params
n = 1000 # total chunks
n_investigated = 100 # how many have been checked
m = 100 # count of true cases
b = -100 # ranker quality (more negative is better quality)
# ### params / data that follow or aren't key for user to specify
B = 100000 # samples from estimated false negative distribution
x = seq(0, 1, length.out=n)
f = function(a) mean(expit(a + b * x)) - m/n
a = uniroot(f, lower=-20, upper=1000)$root
p = expit(a + b * x) # ranked probabilities
y = rbinom(n, 1, p) # actual sampled data
# ### plot true ranked probabilities
plot(p, type="l", ylim=c(0, 1), xlab="Rank", main=expression(paste("Ranked ", p[i])),
ylab=expression(p[i]), lwd=2, col=rgb(.2, .2, .2))
abline(v=n_investigated, lwd=2, col=red, lty=2)
legend("topright", c(expression(p[i]), expression(n[investigated])),
lwd=2, col=c(rgb(.2, .2, .2), red), lty=c(1, 2))
# ### estimate ranks from data so far
# note: because they're ranked, earlier observations are more likely to be 1s than 0s.
# for better estimation properties, I weight later observations more heavily.
weights = round(sqrt(1:n_investigated))
model <- glm(y[1:n_investigated] ~ x[1:n_investigated],
family=binomial(link='logit'), weights=weights)
# ### sample estimated false negatives
draws = rmvnorm(B, mean=model$coefficients, sigma = summary(model)$cov.unscaled)
fn = rep(NA, B)
for(j in 1:B) fn[j] = sum(expit(draws[j, 1] + x[(n_investigated+1):n] * draws[j, 2]))
# ### plot estimated and true false negatives
gray=rgb(0.9, 0.9, 0.9); blue=rgb(0, .4, 1); red=rgb(1, .3, 0)
plt=hist(fn, breaks=100, plot=FALSE)
plt$counts=log10(plt$counts)
plt$counts[plt$counts==-Inf] = 0
y_ticks = 0:4
y_labels <- c("0", expression(10^1), expression(10^2), expression(10^3), expression(10^4))
plot(plt, yaxt="n", col=gray,
xlab="Estimated Cases Missed by Human", ylab="Simulated Counts")
axis(side = 2, at = y_ticks, labels = y_labels, las = 1)
abline(v=sum(expit(a + x[(n_investigated+1):n] * b)), col=blue, lwd=5)
legend("topright", legend = c("Simulation", "Truth"), fill = c(gray, NA),
lwd = c(NA, 4), col = c("black", blue),
border = c("black", NA), cex = 0.8)
The key assumption might not be true. There's no specific reason for the \(p_i\) to take this specific form, or even be monotone decreasing. But we have to start somewhere, and I think this assumption will be pretty good if the ranker does its job.↩︎
Note that the key assumption could have just as well been \(\text{expit}(a+b\cdot i)\), but then coefficients \(a\) and \(b\) would rely on \(n\), which is inelegant, hence the normalization.↩︎
I'm happy to go into details about how such point estimates and covariance are gotten from maximum likelihood theory, but I imagine you either know it already, or don't but assume the details have been worked out. (They have.)↩︎
Caveat emptor: like any predictive distribution, the truth is not guaranteed to be near the average or anything like that, only that the distribution covers the truth, and only approximately so with finite samples. If you run this simulation many times, you'll see this relationship. But simulated coverage is good for \(n=1000\) (approximately our case).↩︎
In the Methods section, this is referred to as \(i'\). I've called it \(n_{\text{investigated}}\) in the plot for interpretability.↩︎